Goto

Collaborating Authors

 neural network accelerator


Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network accelerator

arXiv.org Artificial Intelligence

Why do security cameras, sensors, and siri use cloud servers instead of on-board computation? The lack of very-low-power, high-performance chips greatly limits the ability to field untethered edge devices. We present the NV-1, a new low-power ASIC AI processor that greatly accelerates parallel processing (> 10X) with dramatic reduction in energy consumption (> 100X), via many parallel combined processor-memory units, i.e., a drastically non-von-Neumann architecture, allowing very large numbers of independent processing streams without bottlenecks due to typical monolithic memory. The current initial prototype fab arises from a successful co-development effort between algorithm- and software-driven architectural design and VLSI design realities. An innovative communication protocol minimizes power usage, and data transport costs among nodes were vastly reduced by eliminating the address bus, through local target address matching. Throughout the development process, the software and architecture teams were able to innovate alongside the circuit design team's implementation effort. A digital twin of the proposed hardware was developed early on to ensure that the technical implementation met the architectural specifications, and indeed the predicted performance metrics have now been thoroughly verified in real hardware test data. The resulting device is currently being used in a fielded edge sensor application; additional proofs of principle are in progress demonstrating the proof on the ground of this new real-world extremely low-power high-performance ASIC device.


Pixel Embedding: Fully Quantized Convolutional Neural Network with Differentiable Lookup Table

arXiv.org Artificial Intelligence

By quantizing network weights and activations to low bitwidth, we can obtain hardware-friendly and energy-efficient networks. However, existing quantization techniques utilizing the straight-through estimator and piecewise constant functions face the issue of how to represent originally high-bit input data with low-bit values. To fully quantize deep neural networks, we propose pixel embedding, which replaces each float-valued input pixel with a vector of quantized values by using a lookup table. The lookup table or low-bit representation of pixels is differentiable and trainable by backpropagation. Such replacement of inputs with vectors is similar to word embedding in the natural language processing field. Experiments on ImageNet and CIFAR-100 show that pixel embedding reduces the top-5 error gap caused by quantizing the floating points at the first layer to only 1% for the ImageNet dataset, and the top-1 error gap caused by quantizing first and last layers to slightly over 1% for the CIFAR-100 dataset. The usefulness of pixel embedding is further demonstrated by inference time measurements, which demonstrate over 1.7 times speedup compared to floating point precision first layer.


To PiM or Not to PiM

Communications of the ACM

A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2 TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications.


PulseDL-II: A System-on-Chip Neural Network Accelerator for Timing and Energy Extraction of Nuclear Detector Signals

arXiv.org Artificial Intelligence

Front-end electronics equipped with high-speed digitizers are being used and proposed for future nuclear detectors. Recent literature reveals that deep learning models, especially one-dimensional convolutional neural networks, are promising when dealing with digital signals from nuclear detectors. Simulations and experiments demonstrate the satisfactory accuracy and additional benefits of neural networks in this area. However, specific hardware accelerating such models for online operations still needs to be studied. In this work, we introduce PulseDL-II, a system-on-chip (SoC) specially designed for applications of event feature (time, energy, etc.) extraction from pulses with deep learning. Based on the previous version, PulseDL-II incorporates a RISC CPU into the system structure for better functional flexibility and integrity. The neural network accelerator in the SoC adopts a three-level (arithmetic unit, processing element, neural network) hierarchical architecture and facilitates parameter optimization of the digital design. Furthermore, we devise a quantization scheme compatible with deep learning frameworks (e.g., TensorFlow) within a selected subset of layer types. We validate the correct operations of PulseDL-II on field programmable gate arrays (FPGA) alone and with an experimental setup comprising a direct digital synthesis (DDS) and analog-to-digital converters (ADC). The proposed system achieved 60 ps time resolution and 0.40% energy resolution at signal to noise ratio (SNR) of 47.4 dB.


GANDSE: Generative Adversarial Network based Design Space Exploration for Neural Network Accelerator Design

arXiv.org Artificial Intelligence

With the popularity of deep learning, the hardware implementation platform of deep learning has received increasing interest. Unlike the general purpose devices, e.g., CPU, or GPU, where the deep learning algorithms are executed at the software level, neural network hardware accelerators directly execute the algorithms to achieve higher both energy efficiency and performance improvements. However, as the deep learning algorithms evolve frequently, the engineering effort and cost of designing the hardware accelerators are greatly increased. To improve the design quality while saving the cost, design automation for neural network accelerators was proposed, where design space exploration algorithms are used to automatically search the optimized accelerator design within a design space. Nevertheless, the increasing complexity of the neural network accelerators brings the increasing dimensions to the design space. As a result, the previous design space exploration algorithms are no longer effective enough to find an optimized design. In this work, we propose a neural network accelerator design automation framework named GANDSE, where we rethink the problem of design space exploration, and propose a novel approach based on the generative adversarial network (GAN) to support an optimized exploration for high dimension large design space. The experiments show that GANDSE is able to find the more optimized designs in negligible time compared with approaches including multilayer perceptron and deep reinforcement learning.


PhotoFourier: A Photonic Joint Transform Correlator-Based Neural Network Accelerator

arXiv.org Artificial Intelligence

The last few years have seen a lot of work to address the challenge of low-latency and high-throughput convolutional neural network inference. Integrated photonics has the potential to dramatically accelerate neural networks because of its low-latency nature. Combined with the concept of Joint Transform Correlator (JTC), the computationally expensive convolution functions can be computed instantaneously (time of flight of light) with almost no cost. This 'free' convolution computation provides the theoretical basis of the proposed PhotoFourier JTC-based CNN accelerator. PhotoFourier addresses a myriad of challenges posed by on-chip photonic computing in the Fourier domain including 1D lenses and high-cost optoelectronic conversions. The proposed PhotoFourier accelerator achieves more than 28X better energy-delay product compared to state-of-art photonic neural network accelerators.


Neural network accelerator for quantum control

arXiv.org Artificial Intelligence

Efficient quantum control is necessary for practical quantum computing implementations with current technologies. Conventional algorithms for determining optimal control parameters are computationally expensive, largely excluding them from use outside of the simulation. Existing hardware solutions structured as lookup tables are imprecise and costly. By designing a machine learning model to approximate the results of traditional tools, a more efficient method can be produced. Such a model can then be synthesized into a hardware accelerator for use in quantum systems. In this study, we demonstrate a machine learning algorithm for predicting optimal pulse parameters. This algorithm is lightweight enough to fit on a low-resource FPGA and perform inference with a latency of 175 ns and pipeline interval of 5 ns with $~>~$0.99 gate fidelity. In the long term, such an accelerator could be used near quantum computing hardware where traditional computers cannot operate, enabling quantum control at a reasonable cost at low latencies without incurring large data bandwidths outside of the cryogenic environment.


Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

arXiv.org Artificial Intelligence

We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.


High-level Modeling of Manufacturing Faults in Deep Neural Network Accelerators

arXiv.org Machine Learning

The advent of data-driven real-time applications requires the implementation of Deep Neural Networks (DNNs) on Machine Learning accelerators. Google's Tensor Processing Unit (TPU) is one such neural network accelerator that uses systolic array-based matrix multiplication hardware for computation in its crux. Manufacturing faults at any state element of the matrix multiplication unit can cause unexpected errors in these inference networks. In this paper, we propose a formal model of permanent faults and their propagation in a TPU using the Discrete-Time Markov Chain (DTMC) formalism. The proposed model is analyzed using the probabilistic model checking technique to reason about the likelihood of faulty outputs. The obtained quantitative results show that the classification accuracy is sensitive to the type of permanent faults as well as their location, bit position and the number of layers in the neural network. The conclusions from our theoretical model have been validated using experiments on a digit recognition-based DNN.


EETimes - Deep Learning on MCUs is the Future of Edge Computing

#artificialintelligence

Just a few years ago, it was assumed that machine learning (ML) -- and even deep learning (DL) -- could only be performed on high-end hardware, with training and inference at the edge executed by gateways, edge servers, or data centers. It was a valid assumption at the time because the trend toward distributing computational resources between the cloud and the edge was in its early stages. But this scenario has changed dramatically thanks to intensive research and development efforts made by industry and academia. The result is that today, processors capable of delivering many trillions of operations per second (TOPS) are not required to perform ML. In an increasing number of cases, the latest microcontrollers, some with embedded ML accelerators, can bring ML to edge devices.